This is the Optimization Demonstration for GBM-BRT

Approach

Uncontrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Find the minimal combination of algorithm inputs that maximize accuracy. If there are ties, break them by using the point that requires the least data.
  3. Find the costs associated with running the algorithm with those inputs on all different hardware configurations.
  4. Find the combination of hardware that jointly minimizes cost and time.

Data-contrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Subset the accuracy surface produced above to the amount of data available. The maximizing point will fall in the upper right corner of the subsetted space.
  3. Find the costs associated with running the algorithms with the accuracy-maximizing point on all different hardwares.
  4. Find the combination of hardware that jointly minimizes time and cost.

Cost-constrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Subset the space produced above to the amount of time and money able to be spent on modelling.
  3. Working backwards now, find the accuracies that can be produced in the limited time.
  4. Using the subset of accuracy space, find the combination of algorithm inputs that maximizes accuracy.
knitr::opts_chunk$set(cache=TRUE, echo=F, warning=F, error = F, message=F)
knitr::opts_knit$set(root.dir = "/users/scottsfarley/documents")
setwd("/users/scottsfarley/documents")
library(parallel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
library(akima)
library(ggplot2)
options(java.parameters = "-Xmx1500m")
library(bartMachine)
## Loading required package: rJava
## Loading required package: bartMachineJARs
## Loading required package: car
## Warning: replacing previous import 'lme4::sigma' by 'stats::sigma' when
## loading 'pbkrtest'
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Loading required package: missForest
## Loading required package: itertools
## Welcome to bartMachine v1.2.3! You have 1.4GB memory available.
## 
## If you run out of memory, restart R, and use e.g.
## 'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
## 'library(bartMachine)'.
bartMachine::set_bart_machine_num_cores(3)
## bartMachine now using 3 cores.
library(reshape2)
library(ggdendro)
threshold.time <- 20 ##seconds
threshold.cost <- Inf ##cents
threshold.numTex <- 2500

First, get the training data and fit the model. Perform some skill checks on it.

## bartMachine initializing with 50 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 6 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression ...
## serializing in order to be saved for future R sessions...done
## bartMachine initializing with 50 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 6 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression ...
## serializing in order to be saved for future R sessions...done

Choose a finite number of possible solutions to the model. Ideally, we would want every single combination of predictor variables [0, Inf]. This is obviously intractable. Moreover, I only have data for a subset of that space anyways. So randomly sample the subspace in which I have data to make the problem possible to solve.

Using that subset of data and the models we fit previously, predict each candidate configuration of algorithm inputs and hardware variables for execution time and SDM accuracy.

Plot the posterior means of the accuracy models against the algorithm inputs that should control accuracy. In this case, these are number of training examples and number of covariates.

The accuracy clearly varies from low (few training examples and few covariates) to very high (many covariates, many training examples). Perhaps more data would be helpful here, but what are you going to do. Our task is to find the combinations of inputs that results in the highest accuracy model. If there’s a tie, find the combination that needs the least data.

Now, we know the combination of algorithm inputs that result in the highest accuracy. The figure below shows the combination identified on the training examples and covariates axes. This combination of training examples and number of covariates can be run on any combination of hardware. Some might be suboptimal. Thus, at this point, we’ve solved half of our challenge: algorithm inputs have been optimized, now it’s time optimize hardware.

## [1] "Accuracy is maximized at 10000 training examples and 5 predictors."

In theory, the hardware parameters should not affect the SDM accuracy. We can test this assumption here, by plotting the accuracies obtained for this combination of algorithm inputs against modeled accuracy on the number of CPUs and amount of memory. If the assumption is valid, the plot should show no change in either the horizontal or vertical directions. We see that there is, in fact, some change, though. This is likely due to expeirmental design, and lack of a full factorial design setup. The effect is realtively minor, and I choose to comment it and move along.

## [1] "Accuracy Range on Hardware:  0.0177677638775042"
## [1] "Accuracy Range from Expectation:  0"
## [1] "------"
## [1] "Fixing accuracy at:  0.809766288131944"

Now, fix the algorithm inputs at the accuracy-maximizing point– effectively fixing expected model accuracy. An algorithm with these inputs can be run on any combination of hardware. Project how long that specific model would take and how much it would cost on all computing types. Plot those out on time vs. cost axes.

The optimal solution is the one that balances time and cost equally during the minimization. We use euclidean distance here, which normalizes each dimension by its standard deviation, so they are weighted equally. For each candidate combiantion of hardware, we calculate the distance between it and the origin of these two axes. We then find the minimum of that distance matrix and call that point the optimal.

Our job is complete. We’ve now optimized both the harware and software dimensions of the problem.

## [1] "------GBM-BRT OPTIMAL--------"
## [1] "Predicted Optimal Accuracy 0.809766288131944 +/- 0"
## [1] "Predicted Optimal Cost (seconds) 2344.55209272068"
## [1] "Predicted Optimal Cost (cents) 234.924119690612"
## [1] "Cores:  2"
## [1] "Memory: 4"
## [1] "Training Examples: 10000"
## [1] "Covariates: 5"
## [1] "Distance from origin is:  2356.29239643414"

Everything up to this point was done using the mean of the posterior distribution, a choice which simplifies the process but causes some information loss and may cause over-confidence in the predictions. We can modify our steps to include information from the entire posterior, which may solve this issue.

Instead of projecting just the mean time and mean cost for use the the distance minimization, use the entire set of posterior samples. Calculate the distance metric for each sample in the posterior independently. You’re then left with a density distribution of distances, from which we can infer the minimum value.

The posteriors are in a line, since there’s a fixed linear relationship between time and cost.

Now, find the distance metrics for all of those points.

There’s a lot of overlab in this figure, and many points are far away from the optimal. We don’t care about those. Take the few closest to the minimum and look at their distributions.

Now, the optimal configuration may be one of the following:

config cores GBMemory seconds cost distance.mean distance.sd
13 2 2 2704.995 162.6243 3532.153 2652.495
2 1 4 2691.353 134.8368 3553.190 2683.442
3 1 5 2696.454 162.1108 3559.869 2685.519
4 1 6 2696.454 189.1293 3562.183 2687.265
5 1 8 2740.659 247.1527 3614.674 2702.529
6 1 10 2787.623 307.2518 3679.959 2754.004
7 1 12 2787.623 363.1158 3688.709 2760.553
8 1 14 2787.617 418.9789 3698.765 2767.967
9 1 16 2918.713 497.1735 3900.969 2937.501
1 1 2 2766.460 83.1598 3909.511 3098.745
10 1 18 2946.735 560.9993 3937.390 2949.387
11 1 20 2949.412 620.6152 3954.854 2957.253
12 1 22 2949.489 679.7393 3971.748 2969.828
97 9 2 4022.547 1088.2599 6187.515 5949.073
109 10 2 4022.547 1209.1777 6236.811 5996.470
73 7 2 4242.148 892.6327 6410.895 6147.567
85 8 2 4242.148 1020.1517 6452.366 6187.334
121 11 2 4200.457 1388.9230 6547.650 6282.736
133 12 2 4200.457 1515.1888 6608.699 6341.315
145 13 2 4248.445 1660.2075 6688.577 6326.659
157 14 2 4248.445 1787.9158 6758.989 6393.261
169 15 2 4248.445 1915.6241 6833.809 6464.032
49 5 2 5311.940 798.3847 7302.654 6074.351
61 6 2 5311.940 958.0616 7338.059 6103.801
25 3 2 5736.082 517.2799 7810.075 6472.108

config cores GBMemory seconds cost distance.mean distance.sd cluster
13 2 2 2704.995 162.6243 3532.153 2652.495 1
2 1 4 2691.353 134.8368 3553.190 2683.442 1
3 1 5 2696.454 162.1108 3559.869 2685.519 1
4 1 6 2696.454 189.1293 3562.183 2687.265 1
5 1 8 2740.659 247.1527 3614.674 2702.529 1
6 1 10 2787.623 307.2518 3679.959 2754.004 1
7 1 12 2787.623 363.1158 3688.709 2760.553 1
8 1 14 2787.617 418.9789 3698.765 2767.967 1
9 1 16 2918.713 497.1735 3900.969 2937.501 1
1 1 2 2766.460 83.1598 3909.511 3098.745 1
10 1 18 2946.735 560.9993 3937.390 2949.387 1
11 1 20 2949.412 620.6152 3954.854 2957.253 1
12 1 22 2949.489 679.7393 3971.748 2969.828 1

In the results above, you’re accutally seeing the trade off between time and money play out quite nicely. Adding cores costs money, but, in the case of GBM-BRTs, reduces time. Here, that tradeoff basically exactly evens out.

Data Constraint

In this case, we’ve got a constraint on the amount of data available to us.

## [1] "Current data threshold is  2500"
## [1] "Accuracy is maximized at 2499 training examples and 5 predictors."
## [1] "Expected Max Accuracy is  0.796017942673812"
## [1] "Now there are only:  287 candidates, instead of  287 candidates that can complete this scenario under budget."

## [1] "Recommended # cores:  5"
## [1] "Recommended Memory:  4"
## [1] "Expected Cost:  85.9476676008638"
## [1] "Expected Seconds:  343.104461480494"

config cores GBMemory seconds cost distance.mean distance.sd
50 5 4 343.1045 85.94767 354.0827 16.43787
98 9 4 324.7989 146.45182 356.8586 20.25666
62 6 4 343.1045 103.13720 358.6528 16.65003
51 5 5 343.7686 103.33685 359.3482 16.69296
74 7 4 342.7597 120.20583 363.7228 19.03662
110 10 4 324.7989 162.72425 363.8614 20.65418
1 1 2 363.7366 10.93392 364.5455 21.78204
86 8 4 342.7597 137.37809 369.7697 19.35311
99 9 5 325.4276 176.08239 370.6009 21.01109
97 9 2 517.0649 139.88675 588.3736 263.14650
109 10 2 517.0649 155.42972 593.0612 265.24300
73 7 2 545.3538 114.75334 611.1815 271.76831
85 8 2 545.3538 131.14667 615.1351 273.52631
121 11 2 539.9383 178.53600 624.2592 278.15654
133 12 2 539.9383 194.76654 630.0796 280.75002

config cores GBMemory seconds cost distance.mean distance.sd cluster
50 5 4 343.1045 85.94767 354.0827 16.43787 1
62 6 4 343.1045 103.13720 358.6528 16.65003 1
51 5 5 343.7686 103.33685 359.3482 16.69296 1
74 7 4 342.7597 120.20583 363.7228 19.03662 1

Cost Constraint II

config cores GBMemory seconds cost distance.mean distance.sd
73 7 2 19.95224 4.1983507 2.778164 0.6558623
85 8 2 19.95224 4.7981150 2.796136 0.6601049
1 1 2 19.95433 0.5998271 2.830239 0.4430998
121 11 2 19.94142 6.5938310 2.851433 0.6795293
133 12 2 19.94142 7.1932702 2.878019 0.6858651
145 13 2 19.95729 7.7989079 2.914187 0.6919359
157 14 2 19.95729 8.3988239 2.944866 0.6992201
169 15 2 19.95729 8.9987399 2.977464 0.7069602
181 16 2 19.95729 9.5986559 3.011921 0.7151414
97 9 2 19.97995 5.4053765 3.036575 0.6472420
193 17 2 19.95729 10.1985719 3.048172 0.7237489
109 10 2 19.97995 6.0059739 3.060768 0.6523986
204 17 22 19.99934 78.3538019 3.086156 0.7327676
216 18 22 19.99934 82.9628491 3.125808 0.7421825
228 19 22 19.99934 87.5718962 3.167066 0.7519787

config cores GBMemory seconds cost distance.mean distance.sd cluster
73 7 2 19.95224 4.198351 2.778164 0.6558623 1
85 8 2 19.95224 4.798115 2.796136 0.6601049 1
121 11 2 19.94142 6.593831 2.851433 0.6795293 1
133 12 2 19.94142 7.193270 2.878019 0.6858651 1